A phylogenetic tree (also phylogeny or evolutionary tree) is a branching diagram or a tree showing the evolutionary relationships among various biological species or other entities based upon similarities and differences in their physical or genetic characteristics. In a rooted phylogenetic tree, each node with descendants represents the inferred most recent common ancestor of those descendants, and the edge lengths in some trees may be interpreted as time estimates. Unrooted trees illustrate only the relatedness of the leaf nodes and do not require the ancestral root to be known or inferred.
This is a phylogenetic tree based on rRNA genes, showing the three life domains: bacteria, archaea, and eukaryota. The black branch at the bottom of the phylogenetic tree connects the three branches of living organisms to the last universal common ancestor.
In biology, phenetics, also known as taximetrics, is an attempt to classify organisms based on overall similarity, usually in morphology or other observable traits, regardless of their phylogeny or evolutionary relation.
Cladistics is an approach to biological classification in which organisms are categorized in groups (“clades”) based on hypotheses of most recent common ancestry. The evidence for hypothesized relationships is typically shared derived characteristics (synapomorphies) that are not present in more distant groups and ancestors. Theoretically, a common ancestor and all its descendants are part of the clade, however, from an empirical perspective, common ancestors are inferences based on a cladistic hypothesis of relationships of taxa whose character states can be observed. Cladistics is now the most commonly used method to classify organisms.
In cladistics or phylogenetics, an outgroup is a more distantly related group of organisms that serves as a reference group when determining the evolutionary relationships of the ingroup, the set of organisms under study, and is distinct from sociological outgroups. The outgroup is used as a point of comparison for the ingroup and specifically allows for the phylogeny to be rooted. Because the polarity (direction) of character change can be determined only on a rooted phylogeny, the choice of outgroup is essential for understanding the evolution of traits along a phylogeny
Terminal nodes or tips typically represent extant organisms, also frequently called operational taxonomic units or OTUs. OTU is a generic way of referring to a grouping of organisms (such as a species, a genus, or a phylum), without specifically identifying what that grouping is.
Internal nodes in a phylogenetic tree represent hypothetical ancestors. We postulate their existence but often don’t have direct evidence. The root node is the internal node from which all other nodes in the tree descend. This is often referred to as the last common ancestor (LCA) of the OTUs represented in the tree. In a universal tree of life, the LCA is often referred to as LUCA, the last universal common ancestor. All nodes in the tree can be referred to as OTUs.
Branches connect the nodes in the tree, and generally represent time or some amount of evolutionary change between the OTUs. The specific meaning of the branches will be dependent on the method that was used to build the phylogenetic tree.
A clade in a tree refers to some node (either internal or terminal) and all nodes descending from it (i.e., moving away from the root toward the tips).
We’re using the DNA sequences for a subset of different influenza strains collected from 1993 to 2008 in the US. The full data set and annotation can be found online.
library("adegenet")
library("ape")
library("phangorn")
dna <- fasta2DNAbin(file="course/data/usflu_lite.fasta", quiet=TRUE)
annot <- read.csv("course/data/usflu.annot_lite.csv", header=TRUE, row.names=1)
dna
9 DNA sequences in binary format stored in a matrix.
All sequences of same length: 1701
Labels:
CY012160
CY011528
CY010028
CY006627
CY003785
CY000185
...
Base composition:
a c g t
0.335 0.200 0.225 0.239
(Total: 15.31 kb)
head(annot)
accession year misc
1 CY012160 1993 (A/New York/762/1993(H3N2))
2 CY011528 1995 (A/New York/669/1995(H3N2))
3 CY010028 1996 (A/New York/591/1996(H3N2))
4 CY006627 1997 (A/New York/547/1997(H3N2))
5 CY003785 1999 (A/New York/422/1999(H3N2))
6 CY000185 2001 (A/New York/83/2001(H3N2))
There are plenty of ways for phylogenetic reconstruction. If explicit models of evoluition are assumed, Maximum Likelyhood or Bayesian-based methods can be used for character-based data, and distance-based methods for non-character based data. Otherwise, If explicit models of evoluition are not assumed, Maximum Parsimony methods can be used for character-based data. We will be mainly talking about distance-based methods in this document.
Distance-based trees are produced by calculating the genetic distances between pairs of taxa, followed by hierarchical clustering that creates the actual tree look. While there are tons of algorithms to choose from when computing distances, there are two popular clustering methods that are used most frequently, Neighbor joining and UPGMA.
D <- dist.dna(dna, model = "TN93")
temp <- as.data.frame(as.matrix(D))
table.paint(temp, cleg=0, clabel.row=.5, clabel.col=.5)
The algorithm of neighbor joining is clearly descrbed here. Briefly, it takes as input a distance matrix specifying the distance between each pair of taxa. The algorithm starts with a completely unresolved tree, whose topology corresponds to that of a star network, and iterates over the following steps until the tree is completely resolved and all branch lengths are known:
tre <- nj(D)
tre <- ladderize(tre)
plot(tre, show.tip=FALSE, type="phylogram")
title("Unrooted NJ tree")
myPal <- colorRampPalette(c("red","yellow","green","blue"))
tiplabels(annot$year, bg=num2col(annot$year, col.pal=myPal), cex=0.8)
We can also root the tree using the virus that was isolated first.
tre2 <- root(tre, out = 1)
tre2 <- ladderize(tre2)
plot(tre2, show.tip=FALSE, edge.width=2)
title("Rooted NJ tree")
tiplabels(tre$tip.label, bg=transp(num2col(annot$year, col.pal=myPal),.7), cex=.6, fg="transparent", adj = c(-0.1, 0.5))
axisPhylo()
temp <- pretty(1993:2008, 5)
legend("topright", fill=transp(num2col(temp, col.pal=myPal),.7), leg=temp, ncol=2)
Because there are so many different algorithms to choose from when constructing our tree, we have to make sure the one we chose was appropriate for our dataset, using our original distance matrix (in this case, D). This is much easier than it sounds, and just requires some plots and correlation calculations.
x <- as.vector(D)
y <- as.vector(as.dist(cophenetic(tre2)))
plot(x, y, xlab="original pairwise distances", ylab="pairwise distances on the tree",
main="Is NJ appropriate?", pch=20, col=transp("black",.1), cex=3)
abline(lm(y~x), col="red")
tre3 <- upgma(D)
plot(tre3, show.tip=FALSE, type="phylogram")
title("UPGMA")
myPal <- colorRampPalette(c("red","yellow","green","blue"))
tiplabels(annot$year, bg=num2col(annot$year, col.pal=myPal), cex=0.8)
This tree is called an ultrametric tree, meaning that we assume that all isolates have gone through the same amount of evolution (which is usually not true, especially in this case when our isolates are coming from several different years). We can actually test this.
x <- as.vector(D)
y <- as.vector(as.dist(cophenetic(tre3)))
plot(x, y, xlab="original pairwise distances", ylab="pairwise distances on the tree",
main="Is UPGMA appropriate?", pch=20, col=transp("black",.1), cex=3)
abline(lm(y~x), col="red")
Obvious UPGMA is not appropriate for this data set.
Phylogenetic reconstruction based on parsimony seeks trees which minimize the total number of changes (substitutions) from ancestors to descendents. While a number of criticisms can be made to this approach, it is a simple way to infer phylogenies for data which display low divergence (i.e. most taxa differ from each other by only a few nucleotides, and the overall substitution rate is low).
In practice, there is often no way to perform an exhaustive search amongst all possible trees to find the most parsimonious one, and heuristic algorithms are used to browse the space of possible trees. The strategy is fairly simple: i) initialize the algorithm using a tree and ii) make small changes to the tree and retain those leading to better parsimony, until the parsimony score stops improving.
dna2 <- as.phyDat(dna)
tre.ini <- nj(dist.dna(dna, model="raw"))
parsimony(tre.ini, dna2)
[1] 184
tre.pars <- optim.parsimony(tre.ini, dna2)
Final p-score 184 after 0 nni operations
parsimony(tre.pars, dna2)
[1] 184
plot(tre.pars, type="unrooted", show.tip=FALSE, edge.width=2)
title("Maximum-parsimony tree")
tiplabels(annot$year, bg=transp(num2col(annot$year, col.pal=myPal),.7), cex=.5, fg="transparent")
temp <- pretty(1993:2008, 5)
legend("bottomright", fill=transp(num2col(temp, col.pal=myPal),.7), leg=temp, ncol=2, bg=transp("white"))
Another validation of phylogenetic trees, much more commonly used, is bootstrap. Bootstrapping a phylogeny consists in sampling the nucleotides with replacement, rebuilding the phylogeny, and checking if the original nodes are present in the bootstrapped trees. In practice, this procedure is repeated a large number of times (e.g. 100, 1000), depending on how computerintensive the phylogenetic reconstruction is. The underlying idea is to assess the variability in the obtained topology which results from conducting the analyses on a random sample the genome. Note that the assumption that the analysed sequences represent a random sample of the genome is often dubious.
myBoots <- boot.phylo(tre2, dna, function(e) root(nj(dist.dna(e, model = "TN93")),1))
Running bootstraps: 100 / 100
Calculating bootstrap values... done.
plot(tre2, show.tip=FALSE, edge.width=2)
title("NJ tree + bootstrap values")
tiplabels(tre$tip.label, bg=transp(num2col(annot$year, col.pal=myPal),.7), cex=.6, fg="transparent", adj = c(-0.1, 0.5))
axisPhylo()
temp <- pretty(1993:2008, 5)
legend("topright", fill=transp(num2col(temp, col.pal=myPal),.7), leg=temp, ncol=2)
nodelabels(myBoots, cex=.6)
Because the vertical dimension has no meaning, tree can be displayed in different forms.
Please cite the original authors if you use any code/text/figures from this document.
sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS 10.16
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] phangorn_2.5.5 ape_5.4-1 adegenet_2.1.3 ade4_1.7-15 rmarkdown_2.3
loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 lattice_0.20-41 deldir_0.2-10 class_7.3-17
[5] gtools_3.8.2 digest_0.6.25 mime_0.9 R6_2.4.1
[9] plyr_1.8.6 evaluate_0.14 coda_0.19-4 e1071_1.7-4
[13] ggplot2_3.3.2 pillar_1.4.6 rlang_0.4.7 spdep_1.1-5
[17] gdata_2.18.0 vegan_2.5-7 raster_3.4-5 gmodels_2.18.1
[21] Matrix_1.2-18 splines_4.0.2 stringr_1.4.0 igraph_1.2.5
[25] munsell_0.5.0 shiny_1.5.0 compiler_4.0.2 httpuv_1.5.4
[29] xfun_0.17 pkgconfig_2.0.3 mgcv_1.8-33 htmltools_0.5.0
[33] tidyselect_1.1.0 tibble_3.0.3 expm_0.999-6 quadprog_1.5-8
[37] codetools_0.2-16 permute_0.9-5 crayon_1.3.4 dplyr_1.0.2
[41] later_1.1.0.1 sf_0.9-7 MASS_7.3-52 grid_4.0.2
[45] nlme_3.1-149 spData_0.3.8 xtable_1.8-4 gtable_0.3.0
[49] lifecycle_0.2.0 DBI_1.1.0 magrittr_1.5 units_0.6-7
[53] scales_1.1.1 KernSmooth_2.23-17 stringi_1.4.6 reshape2_1.4.4
[57] LearnBayes_2.15.1 promises_1.1.1 sp_1.4-2 seqinr_4.2-5
[61] ellipsis_0.3.1 generics_0.0.2 vctrs_0.3.4 fastmatch_1.1-0
[65] boot_1.3-25 tools_4.0.2 glue_1.4.2 purrr_0.3.4
[69] parallel_4.0.2 fastmap_1.0.1 yaml_2.2.1 colorspace_1.4-1
[73] cluster_2.1.0 classInt_0.4-3 knitr_1.30